Overview

Dataset statistics

Number of variables3
Number of observations125438
Missing cells0
Missing cells (%)0.0%
Duplicate rows0
Duplicate rows (%)0.0%
Total size in memory2.9 MiB
Average record size in memory24.0 B

Variable types

Categorical1
Numeric2

Alerts

tconst has a high cardinality: 125438 distinct valuesHigh cardinality
numVotes is highly skewed (γ1 = 40.6086984)Skewed
tconst is uniformly distributedUniform
tconst has unique valuesUnique

Reproduction

Analysis started2022-11-22 21:00:25.305250
Analysis finished2022-11-22 21:03:09.486625
Duration2 minutes and 44.18 seconds
Software versionpandas-profiling vdev
Download configurationconfig.json

Variables

tconst
Categorical

HIGH CARDINALITY
UNIFORM
UNIQUE

Distinct125438
Distinct (%)100.0%
Missing0
Missing (%)0.0%
Memory size980.1 KiB
tt0000003
 
1
tt2188056
 
1
tt21879424
 
1
tt2187932
 
1
tt21879178
 
1
Other values (125433)
125433 

Length

Max length10
Median length9
Mean length9.1506242
Min length9

Characters and Unicode

Total characters1147836
Distinct characters11
Distinct categories2 ?
Distinct scripts2 ?
Distinct blocks1 ?
The Unicode Standard assigns character properties to each code point, which can be used to analyse textual variables.

Unique

Unique125438 ?
Unique (%)100.0%

Sample

1st rowtt0000003
2nd rowtt0000033
3rd rowtt0000035
4th rowtt0000040
5th rowtt0000041

Common Values

ValueCountFrequency (%)
tt0000003 1
 
< 0.1%
tt2188056 1
 
< 0.1%
tt21879424 1
 
< 0.1%
tt2187932 1
 
< 0.1%
tt21879178 1
 
< 0.1%
tt21878904 1
 
< 0.1%
tt2187889 1
 
< 0.1%
tt2187887 1
 
< 0.1%
tt2187884 1
 
< 0.1%
tt2187815 1
 
< 0.1%
Other values (125428) 125428
> 99.9%

Length

Histogram of lengths of the category
ValueCountFrequency (%)
tt0000003 1
 
< 0.1%
tt0000035 1
 
< 0.1%
tt0000041 1
 
< 0.1%
tt0000045 1
 
< 0.1%
tt0000046 1
 
< 0.1%
tt0000060 1
 
< 0.1%
tt0000066 1
 
< 0.1%
tt0000067 1
 
< 0.1%
tt0000069 1
 
< 0.1%
tt0000073 1
 
< 0.1%
Other values (125428) 125428
> 99.9%

Most occurring characters

ValueCountFrequency (%)
t 250876
21.9%
0 132397
11.5%
1 111959
9.8%
2 95976
 
8.4%
4 88705
 
7.7%
6 85945
 
7.5%
8 82876
 
7.2%
3 79199
 
6.9%
5 77060
 
6.7%
7 73393
 
6.4%

Most occurring categories

ValueCountFrequency (%)
Decimal Number 896960
78.1%
Lowercase Letter 250876
 
21.9%

Most frequent character per category

Decimal Number
ValueCountFrequency (%)
0 132397
14.8%
1 111959
12.5%
2 95976
10.7%
4 88705
9.9%
6 85945
9.6%
8 82876
9.2%
3 79199
8.8%
5 77060
8.6%
7 73393
8.2%
9 69450
7.7%
Lowercase Letter
ValueCountFrequency (%)
t 250876
100.0%

Most occurring scripts

ValueCountFrequency (%)
Common 896960
78.1%
Latin 250876
 
21.9%

Most frequent character per script

Common
ValueCountFrequency (%)
0 132397
14.8%
1 111959
12.5%
2 95976
10.7%
4 88705
9.9%
6 85945
9.6%
8 82876
9.2%
3 79199
8.8%
5 77060
8.6%
7 73393
8.2%
9 69450
7.7%
Latin
ValueCountFrequency (%)
t 250876
100.0%

Most occurring blocks

ValueCountFrequency (%)
ASCII 1147836
100.0%

Most frequent character per block

ASCII
ValueCountFrequency (%)
t 250876
21.9%
0 132397
11.5%
1 111959
9.8%
2 95976
 
8.4%
4 88705
 
7.7%
6 85945
 
7.5%
8 82876
 
7.2%
3 79199
 
6.9%
5 77060
 
6.7%
7 73393
 
6.4%

averageRating
Real number (ℝ)

Distinct91
Distinct (%)0.1%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean6.9464819
Minimum1
Maximum10
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size980.1 KiB

Quantile statistics

Minimum1
5-th percentile4.4
Q16.2
median7.1
Q37.9
95-th percentile8.9
Maximum10
Range9
Interquartile range (IQR)1.7

Descriptive statistics

Standard deviation1.3901457
Coefficient of variation (CV)0.20012227
Kurtosis1.117804
Mean6.9464819
Median Absolute Deviation (MAD)0.8
Skewness-0.80419873
Sum871352.8
Variance1.9325051
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
7.4 4599
 
3.7%
7.2 4522
 
3.6%
7.6 4423
 
3.5%
7.8 4393
 
3.5%
7 4313
 
3.4%
7.7 4048
 
3.2%
8 4025
 
3.2%
7.5 3977
 
3.2%
7.3 3965
 
3.2%
6.8 3913
 
3.1%
Other values (81) 83260
66.4%
ValueCountFrequency (%)
1 110
0.1%
1.1 22
 
< 0.1%
1.2 56
< 0.1%
1.3 28
 
< 0.1%
1.4 47
< 0.1%
1.5 30
 
< 0.1%
1.6 45
< 0.1%
1.7 37
 
< 0.1%
1.8 48
< 0.1%
1.9 56
< 0.1%
ValueCountFrequency (%)
10 610
0.5%
9.9 107
 
0.1%
9.8 261
 
0.2%
9.7 226
 
0.2%
9.6 374
0.3%
9.5 330
 
0.3%
9.4 514
0.4%
9.3 511
0.4%
9.2 860
0.7%
9.1 745
0.6%

numVotes
Real number (ℝ)

Distinct5577
Distinct (%)4.4%
Missing0
Missing (%)0.0%
Infinite0
Infinite (%)0.0%
Mean939.27737
Minimum5
Maximum1326662
Zeros0
Zeros (%)0.0%
Negative0
Negative (%)0.0%
Memory size980.1 KiB

Quantile statistics

Minimum5
5-th percentile6
Q112
median26
Q3100
95-th percentile1316
Maximum1326662
Range1326657
Interquartile range (IQR)88

Descriptive statistics

Standard deviation14122.848
Coefficient of variation (CV)15.035865
Kurtosis2281.1966
Mean939.27737
Median Absolute Deviation (MAD)18
Skewness40.608698
Sum1.1782108 × 108
Variance1.9945483 × 108
MonotonicityNot monotonic
Histogram with fixed size bins (bins=50)
ValueCountFrequency (%)
8 5220
 
4.2%
7 5064
 
4.0%
6 4855
 
3.9%
9 4832
 
3.9%
10 4317
 
3.4%
11 3902
 
3.1%
12 3644
 
2.9%
13 3285
 
2.6%
14 3076
 
2.5%
5 2978
 
2.4%
Other values (5567) 84265
67.2%
ValueCountFrequency (%)
5 2978
2.4%
6 4855
3.9%
7 5064
4.0%
8 5220
4.2%
9 4832
3.9%
10 4317
3.4%
11 3902
3.1%
12 3644
2.9%
13 3285
2.6%
14 3076
2.5%
ValueCountFrequency (%)
1326662 1
< 0.1%
1167229 1
< 0.1%
1053088 1
< 0.1%
827972 1
< 0.1%
819891 1
< 0.1%
770703 1
< 0.1%
754833 1
< 0.1%
718854 1
< 0.1%
692508 1
< 0.1%
659750 1
< 0.1%

Interactions

Correlations

Auto

The auto setting is an interpretable pairwise column metric of the following mapping:
  • Variable_type-Variable_type : Method, Range
  • Categorical-Categorical : Cramer's V, [0,1]
  • Numerical-Categorical : Cramer's V, [0,1] (using a discretized numerical column)
  • Numerical-Numerical : Spearman's ρ, [-1,1]
The number of bins used in the discretization for the Numerical-Categorical column pair can be changed using config.correlations["auto"].n_bins. The number of bins affects the granularity of the association you wish to measure.

This configuration uses the recommended metric for each pair of columns.

Phik (φk)

Phik (φk) is a new and practical correlation coefficient that works consistently between categorical, ordinal and interval variables, captures non-linear dependency and reverts to the Pearson correlation coefficient in case of a bivariate normal input distribution. There is extensive documentation available here.

Missing values

A simple visualization of nullity by column.
Nullity matrix is a data-dense display which lets you quickly visually pick out patterns in data completion.

Sample

tconstaverageRatingnumVotes
0tt00000036.51737
1tt00000335.51002
2tt00000355.579
3tt00000404.164
4tt00000416.71787
5tt00000453.933
6tt00000464.134
7tt00000607.687
8tt00000663.029
9tt00000675.662
tconstaverageRatingnumVotes
125428tt99136367.431
125429tt99138623.66
125430tt99141566.4163
125431tt99141626.25
125432tt99143924.914
125433tt99147587.419
125434tt99161205.312
125435tt99163168.45
125436tt99165766.121
125437tt99165808.57